{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 21 - k-means clustering\n",
    "\n",
    "We will use the labor market data set from Lab 20.  It is available [here](http://comet.lehman.cuny.edu/owen/teaching/mat328/Nov2019_labor_market_majors.csv)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "import scipy.cluster.hierarchy as shc\n",
    "\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "from sklearn.cluster import AgglomerativeClustering\n",
    "from sklearn.cluster import KMeans\n",
    "\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "from sklearn import datasets\n",
    "\n",
    "%matplotlib inline\n",
    "pd.set_option(\"display.max_columns\", None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clustering labor market data using k-means\n",
    "\n",
    "Let's load the labor market data into a dataframe called `labor`.  Remember to skip the first 13 rows and the last 3 rows, and to set the `Major` column as the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check your dataframe was created correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which two columns are not numerical types (integers or floats)?  Can you guess why?\n",
    "\n",
    "Removes the commas in the `Median Wage Early Career` and `Median Wage Mid-Career` columns, and convert them to be of type float."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the conversion happened correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the k-means clustering algorithm also uses the distance between data points, we need to scale the data in each column to be between 0 and 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Put the scaled data back into a dataframe, using the column and index names from the original data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labor_scaled = pd.DataFrame(labor_scaled, columns = labor.columns, index = labor.index)\n",
    "labor_scaled.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>labor_scaled = pd.DataFrame(labor_scaled, columns = labor.columns, index = labor.index)\n",
    "</code>\n",
    "</details>\n",
    "\n",
    "Let's run k-means clustering on the data.  As with other sci-kit learn algorithm, we first create a KMeans variable (object) with the number of clusters, and then apply it to our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kmeans = KMeans(n_clusters=4, random_state=0)\n",
    "kmeans_clusters = kmeans.fit_predict(labor_scaled)\n",
    "kmeans_clusters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Store these cluster assignments in the `labor` dataframe in a column called `kmeans_clusters`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>labor[\"kmeans_clusters\"] = kmeans_clusters\n",
    "</code>\n",
    "</details>\n",
    "\n",
    "Now use the hierarchical clustering from Lab 20 to also cluster the data into 4 groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Store these cluster assignments in the `labor` dataframe in a column called `tree_cluster`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display the dataframe `labor` and check that that two new columns were added correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare the clusters by using a filter to display only the rows with the first k-means clusters.  How does this cluster differ from the hierarchical one?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now do the same thing for each of the other 3 k-means clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the confusion matrix for the k-means clusters and the tree clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>confusion_matrix(labor[\"kmeans_clusters\"],labor[\"tree_clusters\"])\n",
    "</code>\n",
    "</details>\n",
    "\n",
    "### Clustering digit images\n",
    "\n",
    "Roughly follows the example [here])https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html_\n",
    "\n",
    "Sci-kit learn contains images of hand-written digits.  Each digit has been encoded as an 8 pixel x 8 pixel image, and the 8x8 = 64 pixels are each stored as a number representing the darkness of that pixel.\n",
    "\n",
    "See [here](https://scipy-lectures.org/packages/scikit-learn/auto_examples/plot_digits_simple_classif.html) for example images of the digits.\n",
    "\n",
    "First we load the digits.  As with the other sci-kit learn datasets, it's stored as a dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "digits = datasets.load_digits()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display the possible keys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will just use the data directly as `digits.data` instead of making a dataframe from it (notice there is no list of features or column names here).\n",
    "\n",
    "Run the k-means clustering algorithm with 10 clusters (for digits 0-9) on this data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details> <summary>Answer:</summary>\n",
    "<code>kmeans = KMeans(n_clusters=10, random_state=0)\n",
    "clusters = kmeans.fit_predict(digits.data)\n",
    "</code>\n",
    "</details>\n",
    "\n",
    "We can get the centers of the clusters using `kmeans.cluster_centers_`.  Try it below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "kmeans.cluster_centers_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These centers are not easy to interpret as numbers, but we can plot them with the following code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create 10 sub-plots\n",
    "fig, ax = plt.subplots(2, 5, figsize=(8, 3))\n",
    "# Split the data in 10 groups of 8x8\n",
    "centers = kmeans.cluster_centers_.reshape(10, 8, 8)\n",
    "# For each of the 10 centers, using one of the 10 subplots:\n",
    "for axi, center in zip(ax.flat, centers):\n",
    "    # reset the x and y axes on that subplot\n",
    "    axi.set(xticks=[], yticks=[])\n",
    "    # plot the image\n",
    "    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}